Lets make sure we have read the gapminder data into R and have the relevant packages loaded.
library(readr)
gapminder <- read_csv("raw_data/gapminder.csv")
library(dplyr)
library(ggplot2)
Now make a scatter plot of gdp versus life expectancy as we did in the previous session.
p <- ggplot(gapminder, aes(x = gdpPercap, y=lifeExp,col=continent)) + geom_point()
One very useful feature of ggplot2 is faceting. This allows you to produce plots subset by variables in your data. In the scatter plot above, it was quite difficult to see if the relationship between gdp and life expectancy was the same for each continent. To overcome this, we would like a see a separate plot for each continent.
To facet our data into multiple plots we can use the facet_wrap (1 variable) or facet_grid (2 variables) functions and specify the variable(s) we split by.
p + facet_wrap(~continent)
The facet_grid function will create a grid-like plot with one variable on the x-axis and another on the y-axis.
p + facet_grid(continent~year)
The previous plot was a bit messy as it contained all combinations of year and continent. Let’s suppose we want our analysis to be a bit more focused and disregard countries in Oceania (as there are only 2 in our dataset) and maybe years between 1997 and 2002.
We should know how to restrict the rows from the gapminder dataset using the filter function. Instead of filtering the data, creating a new data frame and constructing the data frame from these new data we can use the%>% operator to create the data frame on the fly and pass directly to ggplot. Thus we don’t have to save a new data frame or alter the original data.
filter(gapminder, continent!="Oceania", year %in% c(1997,2002,2007)) %>%
ggplot(aes(x = gdpPercap, y=lifeExp,col=continent)) + geom_point() + facet_grid(continent~year)
The summarise function can take any R function that takes a vector of values (i.e. a column from a data frame) and returns a single value. Some of the more useful functions include:
min minimum valuemax maximum valuesum sum of valuesmean mean valuesd standard deviationmedian median valueIQR the interquartile rangen_distinct the number of distinct valuesn the number of observations (Note: this is a special function that doesn’t take a vector argument, i.e. column)library(dplyr)
summarise(gapminder, min(lifeExp), max(gdpPercap), mean(pop))
It is also possible to summarise using a function that takes more than one value, i.e. from multiple columns. For example, we could compute the correlation between year and life expectancy. Here we also assign names to the table that is produced.
gapminder %>%
summarise(MinLifeExpectancy = min(lifeExp),
MaximumGDP = max(gdpPercap),
AveragePop = mean(pop),
Correlation = cor(year, lifeExp))
However, it is not particularly useful to calculate such values from the entire table as we have different continents and years. The group_by function allows us to split the table into different categories, and compute summary statistics for each year (for example).
gapminder %>%
group_by(year) %>%
summarise(MinLifeExpectancy = min(lifeExp),
MaximumGDP = max(gdpPercap),
AveragePop = mean(pop))
If we want to summarise several columns we can use the convenient summarise_all function. However, this will return NA values for columns that do not contain numeric values.
gapminder %>%
group_by(continent,year) %>%
summarise_all(mean)
The nice thing about summarise is that it can followed up by any of the other dplyr verbs that we have met so far (select, filter, arrange..etc). As the country column of the previous output containing missing values we can exclude it from further processing.
gapminder %>%
group_by(continent,year) %>%
summarise_all(mean) %>%
select(-country)
Returning to the correlation between life expectancy and year, we can summarise as follows:-
gapminder %>%
group_by(country) %>%
summarise(Correlation = cor(year , lifeExp))
We can then arrange the table by the correlation to see which countries have the lowest correlation
gapminder %>%
group_by(country) %>%
summarise(Correlation = cor(year , lifeExp)) %>%
arrange(Correlation)
We can filter the results to find observations of interest
gapminder %>%
group_by(country) %>%
summarise(Correlation = cor(year , lifeExp)) %>%
filter(Correlation < 0)
The countries we identify could then be used as the basis for a plot.
library(ggplot2)
filter(gapminder, country %in% c("Rwanda","Zambia","Zimbabwe")) %>%
ggplot(aes(x=year, y=lifeExp,col=country)) + geom_line()
gapminder data in an appropriate manner to produce a plot to show the change in average gdpPercap for each continent over time.geom_col function to create the bar plotIn many real life situations, data are spread across multiple tables or spreadsheets. Usually this occurs because different types of information about a subject, e.g. a patient, are collected from different sources. It may be desirable for some analyses to combine data from two or more tables into a single data frame based on a common column, for example, an attribute that uniquely identifies the subject.
dplyr provides a set of join functions for combining two data frames based on matches within specified columns. For those familiar with such SQL, these operations are very similar to carrying out join operations between tables in a relational database.
As a toy example, lets consider two data frames that some results of testing whether genes A, B and C are significant in our study (gene expression, mutations, etc.)
gene_results <- data.frame(Name=LETTERS[1:3], pvalue = c(0.001, 0.1,0.01))
gene_results
We might also have a data frame containing more data about the genes; such as which chromosome they are located on. As part of our data interpretation we might need to know where in the genome the genes are located.
gene_anno <- data.frame(Name = c("A","B","D"), chromosome=c(1,1,3))
gene_anno
There are various ways in which we can join these two tables together. We will first consider the case of a “left join”.
Animated gif by Garrick Aden-Buie
left_join returns all rows from the first data frame regardless of whether there is a match in the second data frame. Rows with no match are included in the resulting data frame but have NA values in the additional columns coming from the second data frame.
Animations to illustrate other types of join are available at https://github.com/gadenbuie/tidy-animated-verbs
left_join(gene_results, gene_anno)
Joining, by = "Name"
right_join is similar but returns all rows from the second data frame that have a match with rows in the first data frame based on the specified column.
right_join(gene_results, gene_anno)
Joining, by = "Name"
inner_join only returns those rows where matches could be made
inner_join(gene_results, gene_anno)
Joining, by = "Name"
We have introduced a few of the essential packages from the R tidyverse that can help with data manipulation and visualisation.
Hopefully you will feel more confident about importing your data into R and producing some useful visualisations. You will probably have questions regarding the analysis of your own data. Some good starting points to get help are listed below.
To finish the workshop we will look at the analysis of some relevant data that we can import into R and analyse with the tools from the workshop.
Data for global COVID-19 cases are available online from CSSE at Johns Hopkins University on their github repository.
github is an excellent way of making your code and analysis available for others to reuse and share. Private repositories with restricted access are also available. Here is a useful beginners guide.
R is capable of downloading files to our own machine so we can analyse them. We need to know the URL (for the COVID data we can find this from github, or use the address below) and can specify what to call the file when it is downloaded.
download.file("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv",destfile = "time_series_covid19_confirmed_global.csv")
trying URL 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
Content type 'text/plain; charset=utf-8' length 494432 bytes (482 KB)
downloaded 482 KB
We can use the read_csv function as before to import the data and take a look. We can see the basic structure of the data is one row for each country / region and columns for cases on each day.
covid <- read_csv("time_series_covid19_confirmed_global.csv")
Parsed with column specification:
cols(
.default = col_double(),
`Province/State` = col_character(),
`Country/Region` = col_character()
)
See spec(...) for full column specifications.
covid
We can potentially join these data to gapminder, but it would be beneficial to have one column name in common between both files. We can rename the Country/Region column of our new data frame to match gapminder.
covid <- read_csv("time_series_covid19_confirmed_global.csv") %>%
rename(country = `Country/Region`)
Parsed with column specification:
cols(
.default = col_double(),
`Province/State` = col_character(),
`Country/Region` = col_character()
)
See spec(...) for full column specifications.
covid
Much of the analysis of this dataset has looked at trends over time (e.g. increasing /decreasing case numbers, comparing trajectories). As we know by now, the ggplot2 package allows us to map columns (variables) in our dataset to aspects of the plot.
In other words, we would expect to create plots by writing code such as:-
ggplot(covid, aes(x = Date, y =...)) + ...
Unfortunately such plots are not possible with the data in it’s current format. Counts for each date are containing in a different column. What we require is a column to indicate the date, and the corresponding count in the next column. Such data arrangements are known as long data; whereas we have wide data. Fortunately we can convert between the two using the tidyr package (also part of tidyverse).
## install tidyr if you don't already have it
install.packages("tidyr")
For more information on tidy data, and how to convert between long and wide data, see
library(tidyr)
covid <- read_csv("time_series_covid19_confirmed_global.csv") %>%
rename(country = `Country/Region`) %>%
pivot_longer(5:last_col(),names_to="Date", values_to="Cases")
Parsed with column specification:
cols(
.default = col_double(),
`Province/State` = col_character(),
`Country/Region` = col_character()
)
See spec(...) for full column specifications.
covid
Another point to note is that the dates are not in an internationally recognised format, which could cause a problem for some visualisations that rely on date order. We can fix by explicitly converting to YYYY-MM-DD format.
covid <- read_csv("time_series_covid19_confirmed_global.csv") %>%
rename(country = `Country/Region`) %>%
pivot_longer(5:last_col(),names_to="Date", values_to="Cases") %>%
mutate(Date=as.Date(Date,"%m/%d/%y"))
Parsed with column specification:
cols(
.default = col_double(),
`Province/State` = col_character(),
`Country/Region` = col_character()
)
See spec(...) for full column specifications.
covid
The number of new cases per-day can also be added to the table by using the lag function from dplyr. However, we need to stop negative values appearing by using the pmax function to make 0 the lowest possible value.
covid <- read_csv("time_series_covid19_confirmed_global.csv") %>%
rename(country = `Country/Region`) %>%
pivot_longer(5:last_col(),names_to="Date", values_to="Cases") %>%
mutate(Date=as.Date(Date,"%m/%d/%y")) %>%
group_by(country) %>%
mutate(DailyCases = Cases - lag(Cases,default = 0)) %>% mutate(DailyCases = pmax(DailyCases,0))
Parsed with column specification:
cols(
.default = col_double(),
`Province/State` = col_character(),
`Country/Region` = col_character()
)
See spec(...) for full column specifications.
covid
What plots and summaries can you make from these data?